date()
## [1] "Wed Dec 02 21:07:23 2020"
The text continues here.
What to expect for this course? The truth is, I don’t know. I have some previous background in programming, mostly with Python, but I have never studied data analysis in any way.
For this course, I expect to learn data analytics work flow and R programming.
Link to my GitHub repository for this course: https://github.com/JariRintaaho/IODS-project
date()
## [1] "Wed Dec 02 21:07:24 2020"
#The data is located in a server. Easyest way to get it, is to use read.csv(url)
url="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
learning2014=read.csv(url)
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The dataset learning2014 consist 166 rows and 7 variables. The structure of the data is shown above. The dataset is handled using data.frame structures in R. The data is related to International survey o Approaches to Learning by Kimmo Vehkalahti.
# ggplot2 is an excellent library for ploting in R. With 3 lines, it is possible to plot practically everything you have ever wanted to see from a datafile.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(learning2014, title="Practically everything you want to see from a data set", ggplot2::aes(colour=gender), method = c("everything", "pearson"))
## Warning in warn_if_args_exist(list(...)): Extra arguments: "method" are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the ggpairs-plot one can see that all the variables, except points, are more or less normally distributed. The gender does not effect dramatically for points. The distributions for males and females are very similar.
# selected explanatory variables for the regression model are: attitude, surf and stra. The target variable is points.
model = lm(points ~ attitude + surf + stra, data=learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + surf + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# The model seems to fit to the dataset with sufficient accuracy.
The plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
par(mfrow = c(2,2))
plot(model, which=c(1,2,5))
The linear model seems to work OK.
date()
## [1] "Wed Dec 02 21:07:29 2020"
url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
data=read.csv(url, sep=",")
The glimpse is much nicer that the summary(). At the moment, we are not interested in statistical parameters.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(data)
## Rows: 382
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
The data seems to have 382 rows and 35 columns. There are several numerical columns, several categorical columns and even one logical column. The data is somehow related to students’ alcohol consumption.
The aim of this task is to analyze alcohol consumption level (logical variable high_use) using four parameters. The interesting parameters are freetime, famrel, goout and absences. It think it is quite obvious that those students how has lot of free time and go out often use more alcohol. According to scientific consensus, family relations affect alcohol consumption. Based on personal experience related to alcohol consumption and studies, high level of alcohol consumption causes absence.
First, let’s draw four histograms. One for each variable
# par makes multiple plots look nicer :)
par(mfrow=c(2,2))
hist(data$freetime)
hist(data$famrel)
hist(data$goout)
hist(data$absences)
It seem that freetime and goout are distributed in more or less Gaussian way. Most of the studest have free time and their going out is average.
The average family relation is skewed. Most of the students have family relation 3 to 5. However, there are still some, whose family relation is 1 or 2.
The absence is also skewed. Nearly all of the students have absence less than 20. However, there seems to be at least one, whose absence is larger than 50.
Let’s then see, how the box plots look like
# ggplot 2 has some nice plot styles.
library(ggplot2)
# par makes multiple plots look nicer :)
par(mfrow=c(2,2))
g1 = ggplot(data, aes(x = high_use, y = freetime))
g1 = g1 + ggtitle("Student freetime by alcohol consumption")
g1 + geom_boxplot() + ylab("freetime")
g2 = ggplot(data, aes(x = high_use, y = famrel))
g2 = g2 + ggtitle("Student famrel by alcohol consumption")
g2 + geom_boxplot() + ylab("famrel")
g3 = ggplot(data, aes(x = high_use, y = goout))
g3 = g3 + ggtitle("Student goout by alcohol consumption")
g3 + geom_boxplot() + ylab("goout")
g4 = ggplot(data, aes(x = high_use, y = absences))
g4 = g4 + ggtitle("Student absences by alcohol consumption")
g4 + geom_boxplot() + ylab("absences")
Based on the findings (the box plots), free time seems to have no effect on alcohol consumption. In both groups (high users and low users) the amouth of free time is identical.
Family relations seems to have effect on alcohol consumption. However, with this parameter there are two outlier points. They are not high users, but they have low scores in family relations.
It seems that going out is linked to being a heavy user of alcohol. Thus, there is one outlier point, who goes out often, but doesn’t use much alcohol.
Absence and being a high user seems to go hand in hand. But, not always. There are lots of outlier points. There are several students, who have plenty of absence, but they are not high users.
Based on the findings, low familiy relations and going out often seems to have strong relations for being a high user of alcohol. Also absence has some explanatory value. Opposite to original hypothesis, the amount of free times seems not to affect on alcohol consumption.
The logistic regression model can be built using a single command
m <- glm(high_use ~ freetime + famrel +goout + absences,
data = data, family = "binomial")
The model summary. The Pr values for each variable shows clear difference. The smaller the Pr value of a parameter is the better explainer of the parameter is. The rule of thumb is, that if Pr < 0.05, it has value for the model. In this case, goout can be seen as an excellent parameter. Both absences and famrel have some value for the model for explaing the high use of alcohol. The free time seems to have no use for the model.
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + famrel + goout + absences,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9033 -0.7761 -0.5096 0.8915 2.3925
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.71107 0.68985 -3.930 8.50e-05 ***
## freetime 0.22413 0.13868 1.616 0.106054
## famrel -0.43185 0.13766 -3.137 0.001707 **
## goout 0.73875 0.12547 5.888 3.91e-09 ***
## absences 0.05893 0.01611 3.658 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 385.67 on 377 degrees of freedom
## AIC: 395.67
##
## Number of Fisher Scoring iterations: 4
The coefficients odds rations and their intervals. The coefficients does not tell much for anyone. But the odds rations and especially their intervals say. If the odds ratio of a parameter is exactly 1, the parameter has no explanatory value for the model. For odds ratio < 1, the parameter has negative correlation and for odds ratio > 1, the correlation is positive.
As one clearly sees, the 95 % confidence boundary for freetime parameter crosses the critical value 1. Therefore, one cannot say, if increased freetime cause high use of alcohol or not.
As expected, family relations have negative correlation. Those studets with high family relations are less likely to be high users of alcohol.
Those students, who go out often, are high users of alcohol, with no doupt. However, the positive correlation between being high user and having lots of absence is only small.
coef(m)
## (Intercept) freetime famrel goout absences
## -2.71107006 0.22413401 -0.43185038 0.73875213 0.05893415
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI = confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.06646565 0.01651175 0.2485937
## freetime 1.25123868 0.95490180 1.6469720
## famrel 0.64930652 0.49374170 0.8486899
## goout 2.09332169 1.64772445 2.6976795
## absences 1.06070539 1.02929793 1.0978106
The predictions for the model can be calculate using predict() function. The predicted values are stored to the same data frame as the actual data is.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
data <- mutate(data, probability = probabilities)
# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = (probability > 0.5))
The requested 2 x 2 cross tabulation of predictions versus the actual values. It can easily see, that the model prediction errors are not symmetrical. The tends to miss most of the high users and give some false positive predictions.
# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 252 18
## TRUE 68 44
date()
## [1] "Wed Dec 02 21:07:30 2020"
As you can see, this file have been made and uploaded to GitHub.
# In order to keep the Rmd file clean, all the libraries are downloaded here. Keep scrolling.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.4 v purrr 0.3.4
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The data to be analyze this week is called Boston. It comes from MASS package. The data is rather classic set of information. It consist data related to housing values in suburbs of Boston and several other parameters including air quality, crime rate and proportion of people of color at those suburbs. The dataset is rather old. It is first published in 1978.
data("Boston")
The str() method is called for checking the basic structure of the data. The data has 506 rows and 14 columns. The variable names and types are listed here:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Here is the pairs-plot of the dataset. Since the data has quite many columns, it practically shows you nothing.
pairs(Boston)
It is better to use corrplot()-method for checking the data set. The tidyverse and corrplot packages were downloaded for this.
The correlations between different parameters are shown here. The larger and darker the circle is, the stronger the correlation is. Blue represents positive and red negative correlation.
cor_matrix<-cor(Boston) %>% round(digits=2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The histograms for each variable can be drawn using ggplot2. Traditionally, the data set is used for demonstrating effects of different parameters to housing values.
ggplot(gather(Boston), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x')
In this task, the data set “Boston” is standardized and crime rate is switched to categorical variable.
As one can see, the mean value for each parameter is zero after scaling.
## First, the data set is scaled and summary is printed
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Then the crime rate is switched to categorical variable.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
The final part of this task is to split the scaled data set into test and train data using 80-20 split.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
After the model is trained, the results it gives to the test set is less-biased estimator for the actual model accuracy.
First, the LDA-model is fitted to the data set. This time, the catergorical crime rate is used as target variable.
# These simple models can be applied using a single line of code.
lda.fit <- lda(crime ~ ., data = train)
A new function “lda-arrows” is defined for the task. The function draws the LDA bi-plot.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
One can see, that is actually relatively easy to separate the high-crime rate areas
First, the correct classes are separated from the test set
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
The rest of the test data is fed into the trained LDA model. The cross tabulated results are printed.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 17 10 4 0
## med_low 5 15 5 0
## med_high 1 8 16 1
## high 0 0 0 20
As the results show, the model can predict the high crime rate suburbs with nearly 100% accuracy. However, the model truly struggles when it tries to decide, should a suburb belong to low or med_low category. In addition, there is some problems to find the difference between med_low and med_high suburbs.
First, let’s reload and standardize the Boston data set
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)
Second, let’s build a k-mean clustering algorithm with the scaled boston data.
Let’s select 2 as the number of clusters. Since 14x14 pair plot is bit difficult to look at. Let’s draw three smaller ones.
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled[1:5], col = km$cluster)
pairs(boston_scaled[5:10], col = km$cluster)
pairs(boston_scaled[10:14], col = km$cluster)
The pair plots shows nice separation of two groups. The results are quite simillar as with the LDA algorithm. With LDA we saw that it is relatively easy to find the suburbs with high crime rate. But there was significant overlapping with suburbs that had low, med_low and med_high crime rate.
First, let’s perform the k-means clustering using 3 centers.
km2 <-kmeans(boston_scaled, centers = 3)
Then, let’s try to fit the LDA to the new k-means model output.
# These simple models can be applied using a single line of code.
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled)
Now, let’s look the results. It seems that the model can predict detect the suburbs with high crime rate using this unsupervised learning method.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km2$cluster)
# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
First part of the super bonus task
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
This first chunk is just for downloading data and libraries.
date()
## [1] "Wed Dec 02 21:07:37 2020"
url="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt"
human=read.csv(url)
library(corrplot)
library(tidyverse)
library(ggplot2)
The data combines several indicators from most countries in the world
Health and knowledge
Empowerment
“Parli.F” = Percetange of female representatives in parliament
“Edu2.F” = Proportion of females with at least secondary education
“Edu2.M” = Proportion of males with at least secondary education
“Labo.F” = Proportion of females in the labour force
“Labo.M” " Proportion of males in the labour force
“Edu2.FM” = Edu2.F / Edu2.M
“Labo.FM” = Labo2.F / Labo2.M
Summary of each individual variable:
print(summary(human))
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Histograms of different variables in the data set.
human %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Corrplot is nice way to inspect linear correlation coefficients between variables in a dataset.
cor_matrix<-cor(human) %>% round(digits=2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
There are several parameters, that seems to correlate rather nicely between each other. For example it is rather obvious, that high life expectancy correlates negatively with maternal mortality rate. One thing was not so obvious, but still interesting. Proportion of females in the labour force does not correlate with anything.
PCA model and the biplot for non-standardized data.
## PCA model and biplot
pca_human <- prcomp(human)
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)
# create object pc_lab to be used as axis labels
pc_lab=paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.5, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
It can be clearly seen, that GNI is only explaining variable for non-standardized data.
Same model, but for scaled data.
human_std <- scale(human)
## PCA model and biplot
pca_human <- prcomp(human_std)
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)
# create object pc_lab to be used as axis labels
pc_lab=paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.5, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
From the data, three different groups of variable can be detected. The first group: In general, Proportion of females in the labour force (“Labo.F”) and Percetange of female representatives in parliament (“Parli.F”) seems to correlate positively. The explanation is more or less obvious. If females participate labour force, the they also participate politics.
Second group consists two variables: Maternal mortality ratio (“Mat.Mor”) and Adolescent birth rate (“Ado.Birth”). This is also rather obvious positive correlation. Maternal mortality ratio is often correlated with income level.
The third group consists last four parameters: “GNI” = Gross National Income per capita, “Life.Exp” = Life expectancy at birth, “Edu.Exp” = Expected years of schooling and Edu2.FM = Proportion of females with at least secondary education / Proportion of males with at least secondary education. These third group parameters correlate negatively with the second group.
First parameter is a linear combination of six parameters: GNI, Life.Exp, Edu.Exp, Edu2.FM, Mat.Mor and Ado.Birth. This parameter could be called for example “Economical index”.
Second PC parameter is a linear combination of Labo.FM and Parli.F. This parameter can be called for example “Equality index”
Let’s first download the data and look the structure
library(FactoMineR)
data(tea)
print(str(tea))
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## NULL
The dataset has 300 rows and 36 variables. For shake of simplicity, let’s narrow down the data set by removing most of the variables
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- subset(tea, select=keep_columns)
print(summary(tea_time))
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
Let’s visualize the data set.
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## List of 1
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 8
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : num 45
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Data shows that: People prefer Earl Grey tea and has no lear opinion about sugar.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
# visualize MCA
plot(mca, invisible=c("ind"))
Data clearly shows, that we have a strong hipster cluster. The people, that drinks unpacked tea buys that from tea shops.
This first chunk is just for libraries.
date()
## [1] "Wed Dec 02 21:07:40 2020"
library(corrplot)
library(tidyverse)
library(ggplot2)
library(tidyr)
library(dplyr)
First, let’s load RATS data set and look its structure
url_RATS="https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt"
RATS=read.csv(url_RATS, sep="\t")
RATS$ID=as.factor(RATS$ID)
RATS$Group=as.factor(RATS$Group)
glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4...
## $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 4...
## $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 4...
## $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 4...
## $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 4...
## $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 4...
## $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 4...
## $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 5...
## $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 5...
## $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 5...
## $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 5...
head(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
tail(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 11 11 2 445 445 450 452 455 455 451 450 462 466 472
## 12 12 2 555 560 565 580 590 597 595 595 612 618 628
## 13 13 3 470 465 475 485 487 493 493 504 507 518 525
## 14 14 3 535 525 530 533 535 540 525 530 543 544 559
## 15 15 3 520 525 530 540 543 546 538 544 553 555 548
## 16 16 3 510 510 520 515 530 538 535 542 550 553 569
The RATS file has 16 time series with 11 time series each. The time serieses can be spitted ont three different groups.
Now, let’s convert the RATS into “long format”
RATSL <- RATS %>%
gather(key = WD, value = Weight, -ID, -Group) %>%
mutate(Time = as.integer(substr(WD,3,4)))
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
head(RATSL)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
tail(RATSL)
## ID Group WD Weight Time
## 171 11 2 WD64 472 64
## 172 12 2 WD64 628 64
## 173 13 3 WD64 525 64
## 174 14 3 WD64 559 64
## 175 15 3 WD64 548 64
## 176 16 3 WD64 569 64
Let’s draw each individual time series.
p1 <- ggplot(RATSL, aes(x = Time, y = Weight, group = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
p6
It can be seen, that group 1 is clearly separated data set. Group 2 has one clear out lier data series. Groups 2 and 3 seems not to differ much.
Let’s standardize the data sets. And then plot them.
# Standardise the scores:
RATSL <- RATSL %>%
group_by(Time) %>%
mutate( stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...
p1 <- ggplot(RATSL, aes(x = Time, y = stdWeight, group = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized Weight")
p6
When standardized the limes looks more or less the same as without standardization.
Let’s then look the box plots.
p1 <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = c(0.8,0.8))
p5 <- p4 + scale_x_discrete(name = "week")
# Black & White version:
#p6 <- p5 + scale_fill_grey(start = 0.5, end = 1)
p5
With the box plots, the difference between groups 2 and 3 comes clear. However, each of the groups have clearsly outliers.
Let’s look the mean values for each group.
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
p1 <- ggplot(RATSL8S, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(Weight)")
p5
The most obvious outlier is for Group 2. Let’s remove it.
# Remove the outlier:
RATSL8S1 <- RATSL8S %>%
filter(mean < 550)
p1 <- ggplot(RATSL8S1, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(Weight), weeks 1-8")
p5
Now, let’s test, how good linear fitting we can do?
# Add the baseline from the original data as a new variable to the summary data:
baseline <- RATS$WD1
RATSL8S2 <- RATSL8S %>%
mutate(baseline)
glimpse(RATSL8S2)
## Rows: 16
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273...
## $ baseline <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555...
# Fit the ANCOVA model and see the results:
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.732 -3.812 1.991 6.889 13.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.14886 19.88779 1.516 0.1554
## baseline 0.93194 0.07793 11.959 5.02e-08 ***
## Group2 31.68866 17.11189 1.852 0.0888 .
## Group3 21.52296 21.13931 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9933
## F-statistic: 747.8 on 3 and 12 DF, p-value: 6.636e-14
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that we can separate the different groups even when the baseline is taken into account.